04:00
📖 Understand the content matter
As a statistician I collaborate frequently with subject matter experts to ensure that I understand the context of the problem at hand.
❓ Understand the objective
It is crucial to understand what the objectives are. Ideally, these are set a priori, or if exploratory analyses are being done that is very explicit from beginning to end.
📏 Understand where the data came from
Was this observational or experimental data? Is any data missing? What are the units? Are there data entry issues?
🧹 Get the data into a tidy, analyzable form
Often we get data in a form that is not easily analyzable. In this class, we will be focusing mostly on statistical methodology once the data is in an analyzable format, but just because it is analyzable doesn’t mean the analysis choice is obvious.
💃 Determine the appropriate model
In this class we are focusing on Linear Models. Linear models are not always appropriate. You must examine your data to determine whether a linear model is a good choice.
Standard form: \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]
Where:
Simple linear regression: \[\mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}, \quad \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}\]
Multiple regression: \[\mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}\]
Hat notation indicates estimates or predicted values
Parameters vs. Estimates:
Observed vs. Predicted:
Definition: A residual is the difference between an observed value and its predicted value from the model
Formula:
\[\hat\varepsilon_i = y_i-\hat{y}_i\]
Problem: Given the regression equation \(\hat{y} = 2.3 + 1.5x\) and the data points below, calculate the residual for each observation:
| x | y | \(\hat{y}\) | \(\hat{\varepsilon}\) |
|---|---|---|---|
| 1 | 4.2 | ? | ? |
| 3 | 6.8 | ? | ? |
| 5 | 11.1 | ? | ? |
04:00
Problem: Given the same regression equation \(\hat{y} = 2.3 + 1.5x\) and \(y = [4.2, 6.8, 11.1]\), \(x = [1, 3, 5]\):
X and the outcome in a vector called ybetay to get residuals04:00
Sum of squared errors:
\[\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \hat\varepsilon_i^2\]
\[\text{SSE} = (\mathbf{y}- \mathbf{X}\boldsymbol\beta)^T(\mathbf{y}- \mathbf{X}\boldsymbol\beta)\]
Problem: Verify that these two expressions for SSE are the same:
Individual terms: \[\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]
Matrix form: \[\text{SSE} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]
Task: Expand the matrix form and show it equals the summation form
03:00
The regression equation: \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]
The goal: Find \(\boldsymbol{\beta}\) that makes \(\mathbf{X}\boldsymbol{\beta}\) as close as possible to \(\mathbf{y}\)
If we have \(n\) observations, we work in \(n\)-dimensional space
\(\mathbf{y}\) is a vector with \(n\) components (one value per observation)
\(\mathbf{X}\boldsymbol{\beta}\) is also a vector with \(n\) components (one prediction per observation)
Both vectors live in the same \(n\)-dimensional space
In words: All possible predictions your model can make
Think of it as: Every combination of your predictor variables
Your data: \(x = [1, 2, 3]\) with 3 observations
Design matrix: \(\mathbf{X} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}\)
Column space contains: All vectors of the form \(\beta_0 \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} + \beta_1 \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}\)
Example 1: Choose \(\beta_0=0, \beta_1=2\) gives \([2, 4, 6]\)
In words: Intercept = 0, slope = 2, so predictions are [2, 4, 6]
Example 2: Choose \(\beta_0=5, \beta_1=0\) gives \([5, 5, 5]\)
In words: Intercept = 5, slope = 0, so all predictions equal 5
Your actual observed data: \(\mathbf{y} = [2.1, 3.9, 5.8]\)
Question: Is there some \(\beta_0, \beta_1\) such that \(\mathbf{X}\boldsymbol{\beta} = \mathbf{y}\) exactly?
In other words: Is \([2.1, 3.9, 5.8]\) exactly equal to \(\beta_0[1,1,1] + \beta_1[1,2,3]\)?
If \(\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_p]\), then:
\(\mathbf{x}_1^T\hat\varepsilon = 0\), \(\mathbf{x}_2^T\hat\varepsilon = 0\), …, \(\mathbf{x}_p^T\hat\varepsilon = 0\)
Stacking these equations gives us: \(\mathbf{X}^T\hat\varepsilon = \mathbf{0}\)
Starting from: \(\mathbf{X}^T\hat\varepsilon = \mathbf{0}\)
Substitute \(\hat\varepsilon = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\):
\[\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}\]
In words: “The residuals are perpendicular to every column of \(\mathbf{X}\)”
Starting with: \(\mathbf{A}\mathbf{x} + \mathbf{b} = \mathbf{c}\)
Solve for \(\mathbf{x}\) step by step:
04:00
Start with: \(\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}\)
Distribute: \(\mathbf{X}^T\mathbf{y} - \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{0}\)
Move the second term to the right side: \(\mathbf{X}^T\mathbf{y} = \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}}\)
We have: \(\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}\)
To solve for \(\hat{\boldsymbol{\beta}}\), multiply both sides by \((\mathbf{X}^T\mathbf{X})^{-1}\):
Result: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)
This is the least squares solution!
Definition: \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)
What it does: \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\)
In words: Takes your observed data and produces the closest possible predictions
Nickname: “Puts the hat on \(\mathbf{y}\)” to get \(\hat{\mathbf{y}}\)
Symmetric: \(\mathbf{H}^T = \mathbf{H}\)
Idempotent: \(\mathbf{H}^2 = \mathbf{H}\)
Verify that: \(\mathbf{H}^2 = \mathbf{H}\)
Hint: Substitute the definition \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) and multiply it out
Remember: \((\mathbf{A}^{-1})^{-1} = \mathbf{A}\)
03:00
We minimize: \(\sum_{i=1}^n (y_i - \hat{y}_i)^2\)
In words: Sum of squared differences between observed and predicted
This equals: \(||\mathbf{y} - \mathbf{X}\boldsymbol{\beta}||^2\)
In words: Squared distance between the observed vector and prediction vector
1. Regression is projection: Finding the closest point in the column space to \(\mathbf{y}\)
2. Orthogonality is key: Residuals perpendicular to column space guarantees minimum distance
3. Hat matrix is the projection operator: \(\mathbf{H}\) projects onto column space
Try this in R:
1. x <- 1:10; y <- 2*x + rnorm(10)
2. X <- cbind(1, x) # Design matrix
3. H <- X %*% solve(t(X) %*% X) %*% t(X) # Hat matrix
4. Check: all.equal(H %*% H, H) # Verify idempotent
5. e <- y - H %*% y # Residuals
6. Check: t(X) %*% e # Should be approximately zero
04:00
Least squares isn’t just algebra - it’s geometry
We’re finding the best approximation to our data within our model’s constraints
The mathematical formulas follow naturally from geometric principles